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Abstract 

In this paper, the acceleration of algorithms using a design of a field pro- 
grammable gate array (FPGA) as a prototype of a static dataflow architec- 
ture is discussed. The static dataflow architecture using operators intercon- 
nected by parallel buses was implemented. Accelerating algorithms using 
a dataflow graph in a reconflgurable system shows the potential for high 
computation rates. The results of benchmarks implemented using the static 
dataflow architecture are reported at the end of this paper. 

Keywords: Accelerating algorithms, Reconflgurable Computing, Static 
Dataflow Graph, Modules C to VHDL. 



1. Introduction 

With the advent of reconflgurable computing, basically using a Field Pro- 
grammable Gate Array(FPGA), researchers are trying to explore the maxi- 
mum capacities of these devices, which are: flexibility, parallelism, optimiza- 
tion for power, security and real time applications [7|, uM ■ 



Because of the complexity of the applications and the large possibilities 
to develop systems using FPGAs, many applications to convert algorithms 
into these devices associated with a General Purpose Processor (GPP) using 
high level language like C and Java is one of the challenges for researchers 
nowadays, especially for accelerating algorithms \d, 9] . 

The main aim of this project was to accelerate the algorithms which 
convert parts of programs written in C language into a static dataflow model 
implemented in a FPGA. 
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This paper is organized as follows. Related work is described in section 
[2J The Dataflow Graph Model is discussed in section [31 In section H] the 
Benchmarks implemented in the Dataflow graph are presented. Section O 
shows the results of the implementations. Section [H] concludes the paper and 
suggests future works. 



2. Related Work 

The dataflow graph model and its architecture was first researched in the 
1970s and was discontinued in the 1990s [H, 0, 13, 15]. Nowadays, it is a 



topic of research once more, mainly because of the advance of technology, 
particulary with the advent of the FPGA [l^, 14 . 



Because the dataflow model has an implicit parallelism and the FPGA 
is composed by parallel circuits, the dataflow model applied to a FPGA has 
the perfect combination to execute applications which also have parallelism 
in their execution [l3|. However, as applications become more complex, 
software development is only possible using high level language such as C or 
Java ^] although only parts of the program will be executed directly into 
the hardware. Thus several tools have been developed to convert C into 
hardware using VHDL language j^, [ll|, 12 . 



In order to analyze the data dependence, many of these systems generate 
an intermediate dataflow graph for pipeline instructions. The optimizations, 
using several techniques such as loop unrolling, are concluded and finally a re- 
configurable hardware using the VHDL language is generated. The hardware 
generated using these tools consists of coarse grain elements or assembler in- 
structions for a customized processor as Picoblase or Nios from Xilinx and 
Altera respectively j2|. 

In our approach, a fine grain instruction using VHDL to implement a 
static dataflow architecture, consisting of various nodes of processing ele- 
ments and arcs to connect those nodes in a graph, is used to accelerate 
algorithms. 



3. The Dataflow Graph Model 

In the Asynchronous Dataflow Graph project developed by Teifel et al. 
14i |. the asynchronous system is a collection of concurrent hardware processes 
that communicate with each other through message-passing channels. These 
messages consist of atomic data items called tokens. Each process can send 
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and receive tokens to and from its environment through communication ports. 
In the Teifel project, asynchronous pipehnes are constructed by connecting 
these ports to each other using channels, where each channel is allowed only 
one sender and one receiver. Since there is no clock in an asynchronous 
design, processes use handshake protocols to send and receive tokens via 
channels. 

In Fig. [T] Teifel describes an equation converted into a dataflow graph 
in three different situations: (a) a pure dataflow graph, (b) a token-based 
asynchronous dataflow pipeline and (c) a clocked dataflow pipeline. 



a b a b a b 



state feedback 




(a) (b) (c) 

Fig. 1: Computation of yn=:yn-j+c(a+b):(a) pure dataflow graph, (b) token-based asyn- 
chronous dataflow pipeUne (fiUed circles indicate tokens, empty circles indicate an absence 
of tokens), and (c) clocked dataflow pipeline [1^]. 



In our project, a collection of concurrent hardware processes that com- 
municate with each other, but using a parallel bus with bits for data and bits 
to control the communication in a synchronous system of communication as 
described in part (c) of the Fig. [H is also used. 

3.1. Dataflow Computations 

In the dataflow graph to accelerate algorithms project, a traditional dataflow 
model described in the literature, where a node is a processing element and 
an arc is the connection between two elements, is used [l|,0,@,[l3, 15|. A data 



bus and a control bus to execute the communication between the operators 
were implemented. The static dataflow graph model, where only one item of 
data can be in an arch was developed. 
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In Fig. [21 a basic operator and its data buses and control buses for 
communication are described. The signal data a, b and z in Fig. [2] are 16- 
bit data traveling through the parallel buses. The signals stra, strb, strz, 
acka, ackb and ackz are 1-bit control data to control communication between 
operators. 




Fig. 2: The basic operator with its data buses and control buses. 




Fig. 3: The communication: a) enabhng the communication, b) sending an item of data, 
c) Acknowledging an item of data. 

The communication between operators is described in Fig. [31 As can be 
clearly seen in the figure, a sender operator and a receiver operator have two 
input data buses a and b, one output data bus z and its respective control 
signals stra, strb, strz, acka, ackb and ackz. Each of the input data bus and 
output data bus is connected to a register to store a receiving item of data 
and to store a sending item of data, represented by rectangles with rounded 
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edges a, h and z in the figure. The output data bus z from the sender 
operator is connected to input data bus a from the receiver operator, the 
output control signal strz from the sender operator is connected to the input 
control signal stra from the receiver operator and the input control signal 
ackz from the sender operator is connected to the output control signal acka 
from the receiver operator. 

A "logic-0" in the signal acfe informs the sender operator that the receiver 
operator is ready to receive data. A "logic-1' in the signal ackz informs the 
sender operator that the receiver operator is busy. A "logic-1" in the signal 
stra informs the receiver operator that an item of data is ready to be sent to 
it from the sender operator. A "logic-0' in the signal stra informs the receiver 
operator that the sender does not an item of data to be sent to it. 

To initiate the communication, an enable signal with a "logic-0" to the 
ackz connected to the sender, is set. Fig. [3^. When the receiver operator is 
ready to receive data, a "logic-1" in the stra strobes an item of data to the 
input data bus a in the receiver operator. Fig. [Sb- Consequently, a "logic-0" 
in the acka acknowledges that the item of data a was received. Fig. [St. 

3.2. The Dataflow Operators 

The dataflow operators were the traditional operators described by Veen 
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15| . which are: copy, non deterministic merge, deterministic merge, branch, 
conditional and primitive operators (add, sub, mul, div, and, or, not, etc.). 

In order to execute the computation of an operator it is necessary that an 
item of data is presented in all its input buses of data. In Fig. HJ operators 
are described where filled circles indicate items of data and empty circles 
show an absence of items of data and the situation of the operator before 



computation and after computation [14 



The functional execution of dataflow operators is described below: 

1. Copy: This dataflow node duplicates an item of data to two receiver 
operators. It receives an item of data in its input data bus and copies 
the item of data to two output data buses. 

2. Primitive: This dataflow node receives two item of data in its input 
data buses, computes the primitive operation with these two items 
of data and generates the result sending it to the output data bus. 
Operators such as add, sub, multiply, divide, and, or, not, if, etc., are 
implemented in the same way. 
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3. Dmerge: This dataflow node performs a two-way controlled data merge 
and allows an item of data to be conditionally read in input data buses. 
It receives a TRUE/FALSE item of data to decide what input data a 
or b respectively to send to the output data z 

4. NDmerg: This dataflow node performs a two-way not controlled data 
merge and allows an item of data to be read on input data buses. The 
first data to arrive into the Ndmerge operator from input a or 6 is sent 
to the output data z. 

5. Branch: This dataflow node performs a two-way controlled data branch 
and allows the item of data to be conditionally sent on to two different 
output buses. It receives a control TRUE/FALSE item of data to decide 
what output data t or / respectively to transfer the input data a. 

3.2.1. The Basic Dataflow Operator Architecture 

A register-transfer-level datapath (RTL) diagram for a sum (ADD) Oper- 
ator is given in Fig. O In the figure, the 1-bit register hita and 1-bit register 
hith are used to inform the ADD operator when the 16-bit register dadoa 
and/or 16-bit register dadob are filled with an item of data, respectively. 

A " logic- 1" in the bita or bitb informs the ADD operator that there is a 
item of data within dadoa or dadob respectively. A "logic-0" in bita or bitb 
informs the ADD operator that the dadoa or dadob is empty. 

When both items of data are in the receiver operator, the ADD operator 
is executed and the result is filled within a 16-bit register dadoz. The 1-bit 
register bitz receives a " logic- 1" to inform that there is a item of data to send 
to the next operator (the signal strz in Fig. 
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Fig. 5: Datapath (RTL) Diagram of ADD Operator. 



The operation process of the ADD operator is described in the ASM chart 
in Fig. O In the figure, there are four described states SO, SI, S2 and S3. 
As can be clearly seen in the figure, the initial state SO is used to initialize 
several signals of the operation process. In state SI, an item of data from the 
input data buses can be received within the operator and the correspondent 
bit of status can be set. Simultaneously the acknowledge signal is also set. 
After receiving all the items of data, the execution of the function within the 
operator is started, described in state S2. Finally, in state S3, several signals 
of the operation process are set to "logic-0" to continue the execution process 
of the operator. 

In the process of the operator there is a Finite State Machine (FSM) 
that controls each step of the execution and the communication between 
operators. 

Although there is a clock (signal CLK in Fig. [S]), communication between 
operators is asynchronous because it is unpredictable when data will be sent 
to the next operator. 

There are three different architectures of operators. One of them is al- 
ready described in Fig. El with two input data buses and just one output 
data bus. That is the case of the primitive operators ADD, SUB, MUL and 
DIV; the relational operators IFgt, IFge, IFlt, IFle, IFeq and IFdf, the logic 
operators AND, OR and NOT; and the control operator NDmerge. Another 
one is the control operator Dmerge with three input data buses and just one 
output data bus. Finally the last one, the control operator Branch with two 
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Fig. 6: ASM Chart of ADD Operator. 



input data buses and two output data buses. 



4. The Benchmarks Implemented in the Dataflow Model 

The benchmarks implemented in the dataflow model were: Fibonacci, 
Max, Dot prod. Vector sum. Bubble sort, and Pop count |lO[ . 

To convert the benchmark algorithms into a VHDL, each benchmark was 
described as a dataflow graph, them an assembler language was used to con- 
vert the dataflow graph into a VHDL. The Fibonacci algorithm was described 
just to illustrate the process to convert an algorithm into a VHDL. The oth- 
ers algorithms were processed in the same way. The Fibonacci algorithm is 
described in Algorithm [T] and its dataflow graph is described in Fig. [71 
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Algorithm 1 Calculate Fibonacci 
first 
second <J= 1 
imp 

for i = to n do 

tmp first + second 
first <^= second 
second ^ tmp 
end for 




Fig. 7: The Fibonacci algorithm described in Dataflow Graphics. 



As can be clearly seen in Fig. [71 there are two parts in the dataflow 
graph: one of them is located on the left side of the flgure and controls the 
loop with index z; on the right side of the flgure the implementation of the 
Fibonacci sequence is described. 

As the dataflow graph consist of nodes and arcs, each node represents an 
operator and each arc represents the communications between two operators. 
In Fig. [71 a label is attributed to each arc in the dataflow graph. As arcs 
represent the communication between two operators, the parallel data bus 
for items of data and the control data bus for control the communications 
are included in the label representations. The assembler language uses the 
name of the operator and its label arcs to convert the dataflow graph into a 
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VHDL. The assembly language for Fibonacci dataflow graph is described in 
Listening [TJ 

As can be clearly seen in Listening [H several node operators and their 
input and output arcs are listed. Labels used to connect nodes operators are 
described initializing with the s character followed by a number and the oth- 
ers are input or output data signals. The labels dadoa, dadob, dadoc, dadod, 
dadoe, dadof, dadog, dadoh, dadoi and dadoj are input data signals used to 
initialize data for the Fibonacci dataflow graph and the labels pfa,nd fibo are 
output data signals to inform the result of the Fibonacci sequence. Specif- 
ically for the Fibonacci sequence, dadoa receives the n Fibonacci argument 
and fibo is the result of the n Fibonacci argument. 

Listing 1: The Assembler Language for Fibonacci Dataflow Graph 

1 . ndmcrgc s 7 , dadob , s 1 ; 

2. dmergc s2 ,dadoc , si , s3 ; 

3. ndmerge dadod , sll , s2 ; 

4. gtdecider dadoa , s4 , s 5 ; 

5. copy s3 , s4 , s9 ; 

6. copy s5,s6,s8; 

7. branch s9,s8,sl0,pf; 

8. copy s6,s7,sl2; 

9 . add slO , dadoe , sll ; 

10. ndmerge sl7,dadof,sl3; 

11. ndmcrgc dadog , s2 5 , s 1 4 ; 

12. ndmcrgc dadoi ,s22 ,s23 ; 

13. ndmerge dadoj ,sl9 ,s21 ; 

14. copy sl8 , sl9 , s20 ; 

15. dmerge s23 , dad oh , s 1 2 , s 24 ; 

16. dmerge s20,s21,s26,s22; 

1 7 . copy s24 , s25 , s26 ; 

18 . add sl3 , sl4 , sl5 ; 

19. copy sl5,sl6,sl8; 

20. copy sl6,sl7,fibo; 



5. Experimental Results 

The benchmarks were implemented using a (7v285tffgll57-3) Virtex FPGA 
from Xilinx and synthesized in ISE 13.1 and the results were compared with 
the same benchmarks implemented in C-to-Verilog and LALP described in 
Tol that were implemented using a (EP1S10F780C6) Stratix FPGA from 



Altera and synthesized in Quartus 11 6.1. 

In Table [T] the results of implementations for each benchmark in C-to- 
Verilog, LALP and Acceleration Algorithms are described. In Fig. [HI a 
synthesis of the results is described. 

As can be clearly seen in Fig. [HI the Acceleration Algorithms occupy 
less Flip Flops (FF) than the C-to-Verilog system, but more than the LALP 
system, for all the benchmarks. For LUT occupancy, the Acceleration Al- 
gorithms occupy less LUTs than the C-to-Verilog system, except for the 



10 



Table 1: The results of implementation for Benchmarks 





iicn en marks 


FF 


LUT 


SI' 


IVlas rreq. 




Bublc Sort 


2353 


2471 


971 


239.45 


C-to-Vcrilog 


Dot prod 


758 


578 


285 


249.36 


Fibonacci 


73 


108 


69 


297.81 




Max vector 


496 


392 


164 


435.9 




Pop count 


1023 


872 


384 


411.22 




Vector sum 


177 


113 


34 


546.538 




Buble Sort 


219 


105 


79 


353.16 




Dot prod 


97 


69 


32 


213.14 


LALP 


Fibonacci 


104 


41 


30 


505.08 




Max vector 


50 


39 


20 


484.97 




Pop count 


350 


215 


115 


503.73 




Vector sum 












Buble Sort 


85 


485 


712 


613.685 




Dot prod 


323 


362 


542 


613.685 


Algorithm Accelerator 


Fibonacci 


72 


482 


755 


612.108 




Max vector 


80 


425 


598 


613.685 




Pop count 


79 


453 


684 


613.685 




Vector sum 


52 


284 


419 


613.685 



Fibonacci, Max and Vector sum benchmarks, but more than the hAhP sys- 
tem, also for all the benchmarks. In the Slices occupancy, the Acceleration 
Algorithms occupy more slices than the C-to-Verligo and the LALP system 
(except for the Bubble sort benchmark). Finally, for Maximum Frequency, 
the Acceleration Algorithms had more speed than the other two systems. 
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Fig. 8: comparing the Benchmarks 
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6. Conclusion and Future Work 

Accelerating Algorithms, by and large, occupy more space within the 
FPGA than the C-to-Verilog and the LALP system. However, accelerating 
algorithms have more speed than the other two systems, although the main 
aim in this project was to validate the implementation model likely to con- 
vert algorithms into the dataflow graph and into a VHDL. Taking this into 
account, accelerating algorithms become one more solution for parallelism 
in FPGA. The benchmarks used in this paper basically perform operations 
using vectors, but it is very important to explore the maximum parallelism 
of the dataflow graph using real parallel applications. Future work would 
be to develop a module to convert C directly into a VHDL, associated with 
the FPGA and to implement a dynamic dataflow model to obtain a better 
performance than the static model implemented in this paper. 
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